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Abstract 

We consider problems of estimation of structured covariance matrices, and in particular of matrices with a 
Toeplitz structure. We follow a geometric viewpoint that is based on some suitable notion of distance. To this end, 
I we overview and compare several alternatives metrics and divergence measures. We advocate a specific one which 

. represents the Wasserstein distance between the corresponding Gaussians distributions and show that it coincides 

with the so-called Bures/Hellinger distance between covariance matrices as well. Most importantly, besides the 
O . physically appealing interpretation, computation of the metric requires solving a linear matrix inequality (LMI). 

' As a consequence, computations scale nicely for problems involving large covariance matrices, and linear prior 

. constraints on the covariance structure are easy to handle. We compare this transportation/Bures/Hellinger metric 

with the maximum likelihood and the Burg methods as to their performance with regard to estimation of power 
spectra with spectral lines on a representative case study from the literatureQ 

O ■ I- Introduction 

■ Consider a zero-mean, real-valued, discrete-time stationary random process {x{t), t G Z}. Let 

r{t) := E {x{k)x{k + t)) , 

with k,t ^ Z, denote the autocorrelation function, and 



in 



r^i ro ■■■ rn-2 



r-(n-i) r_^ri-2) ■■■ ro 
the covariance of the finite (observation) vector 

x= [x{0), x(l), ... x(n-l)]', 

i.e., T = £'(xx'). The covariance has a Toeplitz structure inherited by the time-invariance (stationarity) of the 
^ process. Throughout, the size of such an observation vector and of corresponding finite Toeplitz matrices will 
$—1 ' always be n and n x n, respectively. 

. The power spectrum of the process is uniquely determined by the (infinite) autocorrelation function. This is due to 
the fact that the trigonometric moment problem is determined [li]. Then, starting with Burg's early contributions lO, 
|[3l . modern nonlinear spectral analysis techniques largely rely on admissible estimates of the partial autocorrelation 
sequence {ro, ri, . . . ,r„_i} (equivalently, of the Toeplitz covariance T) from which information is sought about 
corresponding power spectra. Admissibility of the partial autocorrelation sequence amounts to the requirement that 
T is a positive semi-definite matrix, in which case a positive semi-definite extension to an infinite matrix is also 
possible. 

Part of the challenge, which was already addressed by Burg, is due to the fact that the sample covariance 

^ m 

T:=-J]xfcX^, (1) 

k=l 

where x^ (/c G {1, 2, . . . , m}) are independent observation vectors, may not be Toeplitz due to statistical errors. On 
the other hand, estimates of the individual entries {ro, ri, . . . ,r„_i} via averaging over all available samples to 
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within a given time-distance from one another, may not lead to a positive matrix. Either way, the linear structure 
or the positivity is compromised. 

An early popular algorithm by Burg aimed at ensuring positivity via a clever estimation of the so-called partial 
reflection coefficients instead of the autocorrelation coefficients (see e.g., ||3]). Several alternative tricks were devised 
followed by a maximum likelihood approach in IH. However, the issue was never put to rest because all these 
face challenges of their own that lead to poor resolution, bias, "line-spliting" (where sinusoidal components in the 
spectrum generate ghost peaks), and computational difficulties (as in the case of |I4|). The source of the problem is 
largely the error in T which adversely affects our subsequent estimate of the underlying power spectrum (obtained 
using e.g., a Maximum Entropy method, the Capon envelope, etc.). Herein, we do not analyze the problem of going 
from the Toeplitz covariance to a power spectral estimate. Instead we focus only on the problem of estimating the 
Toeplitz covariance from finite observations. 

The Toeplitz covariance matrix is sought as the one closest to T in a suitable geometry. Notions of distance from 
information theory, quantum mechanics, and statistics lead to complementary viewpoints and so does maximum 
likelihood estimation of the autocorrelation coefficients which also provides us with a notion of distance. In Section 
im we outline the geometric viewpoint together with various possibilities for distance measures. In Section |lll] we 
discuss the respective optimization problems and in Section |IV] we compare the three most promising alternatives 
on a specific example from the literature. 

II. Geometric Viewpoint 
Given a sample covariance matrix T, we consider the problem to minimize 

min(i(r,f), (2) 

over the class of admissible matrices 

r := {T : T > 0, T being Toeplitz}. 

In this, d represents a suitable notion of distance. Various such distance measures are motivated below based 
on statistics, information theory, quantum mechanics, and optimal transportation. Occasionally, when the distance 
measure is not symmetric, we use the notation d{T\\T) instead. Such non-symmetric measures are often referred 
to as divergences in the literature. 



A. Likelihood divergence 

We begin by discussing maximum likelihood estimation Q. Assuming that the process {x{t),t G Z} is Gaussian, 
the joint density function for independent observation vectors (fc G {1, 2, . . . , m}) is 



p(X;r) = (27r)-^|T|-f exp (^"^ x'^T-^x^^ , 



with X := [xi, . . . ,Xm]. Then, T = :^XX' and the log-likelihood function becomes 

/:(f,T) = iogp(X;r) 

^ ^rilog(27r) + log |r| + trace(fT-^)) . (3) 



Thus, it is natural to seek a Toeplitz covariance matrix T for which C{T,T) is maximal. Note that if the x^t's are 
independent Gaussian random variables, mT follows a Wishart distribution. Then Q is the log-likelihood function 
of this distribution. 

Alternatively, one may consider the likelihood divergence 

dL(T||f ) := l(logp(x;f ) - logp(x;r)) 

m 

= ^{- log \f\ + log \T\ + trace(f T"^) - n) 
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as a relevant notion of distance since, evidently, 

rfL(T||r) >o, 
dL(r||f) = o^T = f. 

It relates to the KuUback-Leibler divergence between corresponding pdf 's which is discussed next. However, it does 
not define a metric because it lacks symmetry and may also fail to satisfy the triangular inequality. 



B. Kullback-Leibler divergence 

For random variables on with probability density functions p and p, the Kullback-Leibler (KL) divergence 

c^kl(pIIp) := / p\og[^\dx (4) 



represents a well-accepted notion of distance between the two [61. In the case where p and p are normal with 
zero-mean and covariances T and T, respectively, their KL divergence becomes 



dKL{p\\p) = ^ (log |f I - log |r| + trace(rf-i) - n 



while 



dKL{p\\p) = / plog i-]dx 
Jr" \pJ 



^(- log |f I + log |r| + trace(f'r-^) - n) 
dL{T\\f). 



C. Fisher metric and geodesic distance 

The KL divergence induces a Riemannian structure on the manifold of probability distributions. The quadratic 
term of dY^{p\\p + 8) in the perturbation 5 is the Fisher information metric 

gp,Fisher(^) = J —dx. (5) 

This turns out to be natural from one additional perspective. It is the unique Riemannian metric for which the 
stochastic maps are contractive Q -a property that motivates a rich family of metrics in the context of matricial 
counterparts of probability distributions (see below). 

For probability distributions p{x, 6) parameterized by a vector 6 the corresponding metric is often referred to as 
Fisher-Rao ||8jl and given by 



=p, Fisher-Rao 



{5e) = 6'eE 



dlogp\ f dlogp 



de 



de 



For zero-mean Gaussian distributions parameterized by corresponding covariance matrices the metric becomes 



(6) 



and is often named after C.R. Rao. We summarize this below. Throughout |lM||i;' denotes the Frobenius norm 

^trace (MM'). 

Proposition 1: Consider a zero-mean, normal distribution p with covariance T > 0, and a perturbation p^ with 
covariance T + eA. Provided \\T-^/'^eAT-'^/^\\F < 1, 

1 



Moreover, for S = Pe — p, 



dKL(p|b.) = ^gT,Rao(eA) + 0(e3). 



gT,Rao(eA) =2g„p,,her('^)+0(e4). 
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The proof is given in Appendix |Vll 

The Fisher-Rao metric has been studied extensively in recent years HI, 191. Geodesies and the geodesic distance 
on the respective Riemannian manifolds can be computed explicitly. In fact, on the space of covariance matrices, 
the geodesic distance between two points T and T is precisely the log-deviation 

(iLog(T,f) = ||log(f-l/22^r-V2)||^. 

Two properties that are worth noting is that the metric is congruence invariant and that the corresponding metric 
space is complete. 

D. Bures metric and Bures/Hellinger distance 

As noted earlier, the Fisher information metric is the unique Riemannian metric for which stochastic maps 
are contractive. In quantum mechanics, a similar property has been sought for the non-commutative analog of 
probability vectors, namely, density matrices. These are positive semi-definite and have trace equal to one. In this 
setting, there are several metrics for which stochastic maps (these are now linear maps between spaces of density 
matrices, preserving positivity and trace) are contractive. They take the form 

trace(Ai:>T(A)) 

where Z)r(A) can be thought of as a "non-commutative division" of the matrix A by the matrix T. Thus, if T, A 
are scalars, the above collapses to A^/T. Particular expressions generating such a "non-commutative division" are 

Dt,i{A):=T-^A, (7a) 

POD 

Dt.2{^):= iT + sI)-^A{T + sI)-^ds, (7b) 

^0 

Dt,3{A) := M, where ^(TM + MT) = A, (7c) 

see e.g., ifTOl . The metric corresponding to (TTal) was studied by Petz ifTOl . the metric corresponding to (iTbl ) is 
induced by the von Neumann entropy on density matrices and is known as the Kubo-Mori metric ifTOl . while (iTcl) 
gives rise to the Bures metric. 

The Bures metric can also be written as 

gT,Bures(A) := min{ \\Y\\l \A = YW' + WY',T = WW'}, 
w 

see ifTTI . Accordingly, the corresponding geodesic distance on the manifold of density matrices is called the Bures 
length. Assuming the normalization trace(T) = 1 (i.e., that T > is a density matrix), = 1. Thus, we can 

regard W as an element on a unit sphere. Then, the Bures length is the arc length between corresponding points 
on the sphere. 

The Bures metric has a close connection to the so-called Hellinger distance. A generalization of the standard 
Hellinger distance to matrices proposed in Ferrante etal. 1121 is 

dii{T,f) : = mm ^\\T^U -f^vy \ UU' = I,VV' = /} 

= niin|||r2[/-f2||j. I = /|, (8) 

since, clearly, only one unitary transformation U can attain the same minimal value. This differs from the more stan- 
dard way to generalize the scalar Hellinger distance to matrices which is trace((T^/^ — T^/^)^). The generalization 
in ^ is better known in quantum mechanics literature as the Bures distance when the matrices are normalized to 
have trace 1. It is seen that the Bures/Hellinger distance represents a "straight line" distance between representatives 
of two "points" on the sphere as measured when imbedded in a linear Euclidean space. The representatives amount 
to a selection of suitable points on an equivalence class defined via unitary transformations. 
Interestingly, as shown in lITTI . |[T2l . 

j_ 

dii{T,f) = {iv&ce{T + f - 2{f2Tf^)^)y . 



5 

Also, the optimizing unitary matrix [/ in ([8]l is 

We note that the Hellinger distance appUes equally well to positive definite matrices without any need to normalize 
T and T, and as such, it has been used to compare multivariate power spectral densities |[T2ll . 

E. Transportation distance 

We shift to a seemingly different way of comparing pdf's. The transportation distance quantifies the cost for 
transferring one "mass" distribution to another accounting for the combined cost of moving every unit of mass 
from one location to another. Background on transportation problems goes back to the work of G. Monge in the 
1700's. The recent interest was sparked by the developments in the 1940's by L. Kantorovich who is considered 
the father of the subjecH The importance of transportation distances in probability theory stems from the fact that 
the respective metrics are weakly continuous. 

We consider distributions in R" and a quadratic cost. A formulation of the Monge-Kantorovich transportation 
problem (with a quadratic cost) directly in probabilistic terms is as follows. Let X and Y be random variables in 
M" having pdf's and py. Determine 

dliM^Py) :=inf |i?(|X-y|2) I jj = py,j^p = p^y (9) 

The metric is known as the Wasserstein metric and, quite surprisingly, also induces a Riemannian structure on 
probability densities |[T3l . |[T4l - a rather deep result. Returning to the above optimization, the cost is simply the 
minimum variance when the marginals of the joint distribution are specified. 

We now assume that T and T are the co variances of X and Y, respectively, and we let S = E{XY') denote 
their correlation. Further, assuming that their joint distribution is Gaussian we obtain 

^^w, (PjP) = ™™ 1 trace(T + T — S — S') \ 



T S 
S' f 




A closed form solution is easy to obtain ifTSl . lfT6l : 

5o = f-^(firf^)^f 5, (11) 

and the transportation distance is given alternatively by 

dw,ip,p) = (trace(r + f -2(firri)i))' . 

Since this is central to our theme, we provide details in Appendix IVIII Comparing now with the corresponding 
expression for the Hellinger distance we readily have the following. 

Proposition 2: For p and p Gaussian zero mean distributions with covariances T and T, respectively, 

dB{T,f) = dwAp,?)- 

III. Approximation of structured covariances 

Returning to the structured covariance approximation problem, we consider the computation of the optimizers 
for dUl. We do this for every choice of distance discussed in the previous section. 



^L. Kantorovich received the Nobel prize in Economics in 1975 for his related work on mass transport and resource allocation. 
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A. Approximation based on KL divergence and likelihood 

If we use c^KL given in ([4]) as the distance between T and T, for the approximation problem we need to solve 

. jlog |f I - log |r| + trace(rf~^) - n} . (12) 



mm 



This is convex in T, provided T > 0, and hence numerically feasible. However, the problem is vacuous when T 
is singular. This is unsatisfactory since the case when T is singular is important and quite common. Alternatively, 
if we use the likelihood divergence di^{T\\T) as distance measure, the optimization problem 



mm 



|-log|f| +log|r| +trace(fr-^) - n} (13) 



is well defined for singular T as well. 

A necessary condition for a local minimum of ([T3] ) given in is: 



trace ( (T-^f T"^ - T~^)Q ) = 0, (14) 



for all Toeplitz Q and pointed out in lH that, provided T is not singular, there is at least one local minimum of 
([T3] ) which is positive definite. Based on this. Burg etal llH give a numerical method to solve ([T3] ). The method is 
computationally demanding and numerically sensitive, especially when T is singular. 

B. Approximation based on log-deviation 
The optimization problem 



^{||iog(r-V2rf-i/2)||^}. (15) 

is not convex in T. Linearization of the objective function about T may be used instead, since this leads to 

min|||f ~i/2j^f"^/^ - (16) 

which is a convex problem. 



mm • 



C. Based on Hellinger and transportation distance 

Using du (T, T) the relevant optimization problem Q becomes 



mm 

Ter 



{trace(T + f - 2(firfi)i)} . (17) 



At the outset, this appears difficult. However, from Proposition |2] we know that dii{T,T) = d^2{p,p). Hence, we 
may evaluate ([TT] ) via solving 



min < trace (T + T — S — S') 



T S 
S' f 



>0^ (18) 



This is now a semi-definite program and can be solved quite efficiently ifTTl . 

The above expression for the transportation distance can be given an alternative interpretation as follows. We 
postulate the statistical model 

X = X + V 

where v represents noise, and T and T are the covariances of x and x, respectively. The covariance of x is known 
to be in the admissible set T while that of x may not, due to noise. Thus, in the absence of additional priors, it is 
reasonable to seek an "explanation" of the estimated covariance T by assuming the least possible amount of noise. 
Allowing for possible coupling between x and v brings us to minimize 

^{v'v} = trace(r + f-S-S') 

subject to positive semi-definiteness of the covariance of [x', x']'- This is precisely ([TSl l. 
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Analogous rationale, albeit with different assumptions has been used to justify different methods. For instance, 
assuming that x = x + v where x and v are independent leads to 

min |trace(f - T) | f - T > o| 
which is a method proposed in HI]. Then, also, assuming a "symmetric" noise contribution as in 

X + V = X + V, 

where the noise vectors v and v are independent of x and x, leads to 

min |trace(Q + Q) I f + Q = T + Q, Q,Q>o|, 
T£T,Q,Q ^ J 

where Q and Q designate covariances of v and v, respectively. The minimum in this case is the nuclear norm of 
T — T and studied as a possibility in |[T9ll . 

IV. Examples 

We now compare how well two of the methods outlined earlier perform in identifying a single spectral line 
in white noise and compare those with the standard Burg's method. We choose parameters as in the example in 
Burg etal. H. For constructing power spectra corresponding to a finite set of co variance samples and we use 
autoregressive models in order to be consistent with HI. By using the same type of power spectra we isolate and 
compare the effect of correcting for the "non-Toeplitz-ness" via each of these two methods and by Burg's method. 

The data consists of a sinusoid (leading to a single spectral line) with three different phase values and the same 
random vector for the noise. We assume a single observation vector of size 11, hence both x and v are vectors, 
and thus, the estimated covariance T is (11 x 11), singular, and of rank equal to 1. Thus, the data is the same 
additive mixture of sinusoid and noise as in 111: 

x{t) = cos(^t + ip) + v{t),t = 0, 1, . . . , 10. 

The initial phase t/; is chosen for three different values ^ and The noise vector v is fixed as 

V = [0.000562, - 0.019127, 0.007377, - 0.000149, -0.007479, - 0.013960, 
0.003510, 0.012380, 0.006979, 0.003092, 0.010053]', 

generated from a zero-mean Gaussian distribution with variance 0.0001. 

The first plot in Figure [T] shows the power spectral density (PSD) using Burg's method for estimating the partial 
correlation coefficients (as in H), while the second and third plots are based on covariances approximated using 
the likelihood-based method and transportation-based methods, respectively. The data corresponds to tp = j and 
the resolution of the plots is The (red) arrow in the plots indicates the frequency of the sinusoidal component. 
Burg's method splits the spectral line into three. The spectral line closest to the true (red arrow) is also significantly 
off. On the other hand, both, the likelihood-based and the transportation-based methods detect the spectral line at 
the correct frequency (with relatively insignificant error). 

Figure |2] shows the same situation but for 7/; = | . All three methods detect the spectral line perfectly, to within 
the stated resolution. Figure [3] corresponds to the case where tjj = Burg's method consistently splits the true 
spectral line into two nearby ones. The likelihood-based method gives a small peak near the true spectral line, 
although the dominant line is located at the true frequency to within the stated resolution. On the other hand, the 
transportation-based method gives a result which is consistent with the previous two situations. For the purpose of 
detecting line spectra, the transportation-based method appears to be the most robust. 

A potential drawback of the transportation-based method is that it gives a biased estimate for the energy in the 
sinusoidal component. This is typically smaller than the true value in the example. 
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Fig. 1. Estimated maximum entropy spectrum for -0 = ^:1) Burg's method, ii) Maximum likelihood method, iii) Minimum transportation 
method. 
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Fig. 2. Estimated maximum entropy spectrum for ^ = ^: i) Burg's method, ii) Maximum likelihood method, iii) Minimum transportation 
method. 



V. Recap 

Most modem spectral analysis methods rely on estimated covariance statistics. Yet, they are sensitive to those 
statistics abiding by the requisite linear structure, e.g., Toeplitz. In this paper we discussed and compared two of the 
most promising methods for approximating a sample covariance with one of the required structure. Contributions 
in the paper include drawing the connection between approximation in the Hellinger distance and approximation 
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Fig. 3. Estimated maximum entropy spectrum for 'tp — i) Burg's method, ii) Maximum likelihood method, iii) Minimum transportation 
method. 



in the sense of optimal mass transport. The latter can be cast as a semidefinite program which is easy to solve and 
impervious to possible singularity or near-singularity of the sample covariance. 

The issue with the sample covariance T being singular is often neglected in estimation problems. Yet, it is 
ubiquitous when only few short observation records are available — a situation which is common in the analysis 
of non-stationary processes. Furthermore, the uniqueness and other properties of a maximum likelihood estimate, 
when T is singular, are not well understood lH. 

As a final remark we note that interest in other linear structures for covariance matrices, besides that of Toeplitz, 
arises when the vectorial process is the state vector of a linear system. In such a case, T satisfies linear constraints 
that involve the system dynamics ll20l . All earlier discussion and methods can be repeated verbatim for the problem 
of approximating sample state-covariances. 

VI. Appendix A: proof for the Proposition [T] 

The KL divergence between a zero-mean normal distribution p with covariance T > and a perturbation with 
covariance T + eA is 

t^KL^I \Pe) = \ (log det(r + eA) - log det(r) + trace ((T + eA)-^T) - n) . 

Define Ay = T-i/^AT-^s, then 

dYi.{p\\v.) = \ (logdet (t^I'^{1 + eAT)Ti/^) - logdet(T) + trace (t~^I'^{I + t^TY^T~^l'^T^ - n) 

= ^ (log det(/ + eAr) + trace(/ + eA^)"^ - n) . (19) 

We expand (/ + eA^)"^ into the Taylor series 

(/ + eAr)-^ = / - eAr + e^A^^ - e^Ay^ + • • • . (20) 
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Let Xi, i = 1, represent eigenvalues of At, then 

n 

log det(/ + eAr) = log(l + eA,) 
1=1 



i=l 



etrace(Ar) - -e^tmceiAx'^) + -e^trace(AT^) H • (21) 

2 3 

We substitute ^ and ([IB into (O to obtain 

dKL{p\\Pe) = ^ehmce{AT^) + 0{e^). 

By a similar computation, one can easily see that dKL{Pe\\p) gives rise to the same metric, though the coefficients 
of higher order terms on e are different from those corresponding to d^{p\\p^). 

To draw a connection with the Fisher metric, we substitute 6 = Pe — p into the Fisher metric: 

gp,F,sher(')) - ^J^^ (2vr)"/2 det(T + eAf - ^ 
Since HeArjliT' < 1, (? At^ < I and hence 

-/ < eAr < I. 

Multiplying by T^^^ from left and right on all sides of the above inequality, we obtain 

-T < eA < r, 



or equivalently 
It follows that 

or equivalently, 
Consequently 



< -T + -eA < T. 
2 2 



-{^T + ^eAr'<-T-\ 



2(r + eA)~^ -r~^ > 0. 



i g-iy'(2(r+eA)-i-r-i)j/ 



(27r)'^/2 det (2(r + eA)-i - r-i)"^/^ 

is a Gaussian distribution with mean and covariance (2(T + eA)~^ — T~^)^^. Since the integral of a Gaussian 
distribution is 1, we obtain that 

det(r)V2 \ 

gp,F.heAOJ 1^^^^ ^^^^ ^ _ ^^^^^ ^ . 

But 

(T + eA)-i = T-^/2(j ^ eAry^T-^/^, 

and 

2(r + eA)-i - = (^2(/ + eAy)"^ - /) T'^/^^ 

Consequently, 

det (2(r + eA)-i - T^^)^^^ det{T + eA) = det ((T + eA) {2{T + eA)-^ - T-^) (T + eA))^/^ 

= det (t^/\I + eAr) (2(7 + eA^)"^ - l) {I + eAT)T^/^'^ ^'^ 
= det(T)V2det(/-e2AT2)i/2, 
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and 

gp,K.sher,„^(5) = (det(/ - e'Ar'r'^' - l) = (det(/ + e'Ar' + e'Ar' + ■■■ f'^ - l) • 
Once again considering the eigenvalues of A^- we get 

/ n oo \ 1/2 

det(/ + ^At" + e^r^ + • • • = "Wi^ieX^r) 



1/2 



\k=\ i=0 / 
n 

k=\ k<l 
n 

where in the last equality we have used the fact that XI = trace(Ar^) = ||Ar|||,. Therefore, 

k=l 

gp,Ksher('5) = ^gT,Rao(A) + O(e^). 

VII. Appendix B 

We now show that given two n x n matrices T > and T > 0, 

' T S ' 



arg irdn trace(T + T — S — S') 



S' T 



> 



has indeed the explicit closed-form expression 

So = T-2{T2TT2)tT2, (22) 
Consider the Shur complement 

P:=T- Sf~^S' 

which is clearly nonnegative definite. Then, ST 2 = [T — Pj^U, where UU = I, and 

S = {T - P)2Uf2. (23) 

Moreover, 

trace(5) = trace((r - P)-2Uf^) = trace(f ^(T - P)^-U). (24) 

Since T and T are given, minimizing tracefT + T — S — S') is the same as maximizing tracefS). Let C/sAsVZ. 
be the singular value decomposition of T^{T — P)^, and 

Uo := arg max {trace(f2(r - P)iC/) | UU' = I}. 
Then, Uq must satisfy VI^Uq = U'g and 

r^(r-p)3j7o = (T3(r-p)f^)i (25) 

From dlU) we have trace(5) = trace((r^(T - P)f^)^). Since P > 0, the trace(S') is maxunal when P = 0. 
Moreover, if P = 0, 

rank Q J | 
and f = SqT~^So. Thus, setting P = into we have 



< rank(T), 



1^ 1,^1 ^1,1 



Uo = T~2T~2(T2TT2)2. 

and consequently 5o = T-^{T'iTT'i)'iTi 



1,^1 
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